Daily Weather Data Analysis and Classification

In this article, we analyze a weather dataset from Kaggle.com.

Table of Contents

Daily Weather Datase

Data description from Kaggle:

Loading the Dataset

Features

Columns Description
Air Pressure Air pressure StartFragment in hectopascal (100 pascals) at 9 AM
Air Temperature Air temperature in degrees Fahrenheit at 9 AM
Avg Wind Direction Average wind direction over the minute before the timestamp in degrees (0 starts from the north) at 9 AM
Avg Wind Speed Average wind speed over the minute before the timestamp in meter per seconds (m/s) at 9 AM
Max Wind Direction Highest wind direction in the minute before the timestamp in degrees (0 starts from the north) at 9 AM
Max Wind Speed Highest wind speed in the minute before the timestamp in meter per seconds (m/s) at 9 AM
Min Wind Speed Smallest wind speed in the minute before the timestamp in meter per seconds (m/s) at 9 AM
Rain Accumulation Accumulated rain in millimeters (mm) at 9 AM
Rain Duration Length of time rain in seconds (s) at 9 AM
Relative Humidity (Morning) Relative humidity in percentage in at 9 AM
Relative Humidity (Afternoon) Relative humidity in percentage at 3 PM

For convenience, we would like to modify the feature names.

Preprocessing

Imputing Missing Values

Note that

Problem Description

Let's set Relative Humidity (Afternoon) as the target variable. This means given the dataset and using the rest of the features, we would like to know whether is humid or not at 3 PM. In doing so, define a Humidity Level (Afternoon) feature as follows:

$$\text{Humidity Level (Afternoon)} = \begin{cases} 0 &\mbox{Very Low} \\ 1 &\mbox{Low} \\ 2 &\mbox{Medium} \\ 3 &\mbox{High} \end{cases}$$

Variance of the Features

Furthemore, let's look at the variance of our dataset features.

Furthermore, we would like to standardize features by removing the mean and scaling to unit variance. In this article, we demonstrated the benefits of scaling data using StandardScaler().

We can visualize the data using Parallel Coordinates.

However, the results of this visualization can be improved if a clustering method is used. For this reason, we K-Means clustering method.

Modeling and Classification

Train and Test sets

Multi-layer Perceptron (MLP) for Multi-Class classification

Here, we implement an Multi-layer Perceptron (MLP) for Multi-Class classification using Keras. For more details see Deep Learning from Note.

Model Optimization Plots

Confusion Matrix


References

  1. Kaggle Weather Data
  2. Keras developer guides
  3. Multilayer perceptron wikipedia page
  4. Confusion matrix wikipedia page